Frontiers in Artificial Intelligence
○ Frontiers Media SA
Preprints posted in the last 7 days, ranked by how well they match Frontiers in Artificial Intelligence's content profile, based on 18 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.
Show abstract
Interpreting machine learning models typically relies on feature attribution methods that quantify the contribution of individual variables to model predictions. However, it remains unclear whether attribution magnitude reflects the true functional importance of features for model performance. Here, we present a unified interpretability framework integrating permutation-based attribution, feature ablation, and stability under perturbation across multiple feature spaces. Using nested cross-validation and permutation-based null diagnostics, we systematically evaluate the relationship between attribution magnitude and functional dependence in clinical and biomarker-based prediction models. Attribution magnitude is frequently misaligned with functional importance, with weak to strong negative correlations observed across feature spaces (Spearman {rho} ranging from -0.374 to -0.917). Features with high attribution often have limited impact on model performance when removed, whereas features with low attribution can be essential for maintaining predictive accuracy. These discrepancies define distinct classes of interpretability failure, including attribution excess and latent dependence. Interpretability further depends on feature space composition, and stable, functionally relevant features are not necessarily those with the highest attribution scores. By integrating attribution, functional impact, and stability into a composite Feature Reliability Score, we identify features that remain informative across perturbations and analytical contexts. These findings indicate that interpretability does not arise from attribution magnitude alone but is better characterized from stability under perturbation. This framework provides a basis for more robust model interpretation and highlights limitations of attribution-centric approaches in high-dimensional and correlated data settings.
Lukhele, N.; Mostafa, F.
Show abstract
Objective To develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. Methods A clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs >35 years) and gender. SHAP was developed for model interpretability. Results Ensemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged >;35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. Conclusion This study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.
Gantenberg, J. R.; La Joie, R.; Heston, M. B.; Ackley, S. F.
Show abstract
Qualitative models of Alzheimers pathology often posit that amyloid accumulation follows a sigmoid curve, indicating that the rate of deposition wanes over time. Longitudinal PET data now allow us to investigate amyloid accumulation trajectories with greater detail and over longer follow-up periods. We combine inferences from simulated amyloid trajectories, empirical PET data from the Alzheimers Disease Neuroimaging Initiative (ADNI), and the sampled iterative local approximation algorithm (SILA) to assess whether amyloid accumulation reaches a physiologic ceiling. We find that SILA reliably detects a ceiling, when present, across a range of simulated scenarios that impose a sigmoid shape. When fit to empirical data from ADNI, however, SILA does not appear to indicate the presence of a ceiling. Thus, we conclude that amyloid trajectories may not reach a physiologic ceiling during the stages of Alzheimers disease typically observed while patients remain under follow-up in cohort studies. Fits using SILA indicate that illustrative models of biomarker cascades, while useful tools for conceptualizing and interrogating pathologic processes, may not represent the shapes of amyloid trajectories accurately. Summary for General PublicAmyloid, a protein implicated in Alzheimers disease, is thought to reach a plateau in the brain, but methods that estimate how amyloid changes over time suggest it grows unabated. Gantenberg et al. use one such method and simulations to argue that amyloid does not reach a plateau during the typical course of Alzheimers.
Jiang, S.; Foo, J. C.; Roper, L.; Yang, E.; Green, B.; Arnau, R.; Behavioral Addictions Studies and Insights Consortium, ; Lodhi, R. J.; Isenberg, R.; Wishart, D. S.; Fujiwara, E.; Carnes, P. J.; Aitchison, K. J.
Show abstract
Objectives: Non-suicidal self-injury (NSSI) and self-harming sexual behaviours share functional and behavioural overlaps. However, the relationship between NSSI and problematic sexual behaviour (PSB) remains underexplored. This study aimed to investigate the association between NSSI and PSB in two cohorts - a non-clinical university cohort and a clinical PSB patient cohort. Methods: Data were collected from 2,189 university participants and 477 clinical PSB patients. NSSI was assessed via self-report, and PSB was measured with the Sexual Addiction Screening Test-Revised (SAST-R) Core. The four core addictive dimensions of PSB: relationship disturbance, loss of control, preoccupation, and affect disturbance, were also evaluated. Logistic regression analyses were conducted to examine the association between PSB (presence/absence and severity) and NSSI, looking at effects of gender and contributions of addictive dimensions of PSB. Results: Rates of NSSI were similar in the university (7.1%) and patient (5.7%) cohorts; stratified by gender, a higher proportion of women PSB patients had NSSI compared to in the university cohort (29.3% vs 9.3%). In the university group, who had milder PSB than patients, PSB was associated with NSSI (OR=2.11, p<0.001); a significant gender by PSB interaction was found showing that women with PSB were over four times more likely to have NSSI than men without PSB (OR=4.44, p=0.037). In contrast, PSB severity was not associated with NSSI in PSB patients (OR=1.10, p=0.25). Associations of the addictive dimensions of PSB with NSSI were observed only in the subgroup of university women, in the 'preoccupation' dimension (p<0.001). Conclusions: Our findings highlight gender-specific patterns in the association between PSB and NSSI, suggesting the need for further research and possibly targeted prevention and intervention strategies in women.
Rajasuriya, M.; Chulasiri, P.; Ratnayake, P.; Plevin, D.
Show abstract
Objectives: To evaluate the effectiveness and cultural feasibility of family-supervised disulfiram as a first-line treatment for alcohol use disorder (AUD) in Sri Lanka, and to compare its clinical outcomes with standard therapy delivered at a tertiary psychiatric unit. Design: Single-blind Randomized Controlled Trial known as ETAT-RCT (Efficacy of Two Alcohol Treatments) was conducted under routine clinical setup with three parallel groups: family-supervised disulfiram, locally developed psychosocial intervention, and routine treatment. Allocation was independently concealed; assessors were blinded. Analyses followed an intention-to-treat approach using repeated-measures ANOVA (group x time). This paper reports the disulfiram (test) versus routine treatment (control) comparison; the psychosocial intervention will be reported separately. Setting: University Psychiatry Unit, National Hospital of Sri Lanka, Colombo (UPU, NHSLC). Participants: Patients aged [≥]14 years with AUD presenting to the unit were recruited consecutively without inducements. Planned allocation ratio was 1:1:1 with 31 participants per arm; key exclusions were lifetime psychotic disorder and current contraindication to disulfiram. Randomisation: Participants were randomised into each treatment arm using an independent concealed paper-based allocation system. Intervention: (1) family-supervised disulfiram, with psychoeducation/support only - DT arm, (2) a locally developed denormalization focused psychosocial programme - PT arm, and (3) standard therapy (motivational/cognitive/behavioural input; naltrexone permitted; no disulfiram/denormalisation) - ST arm. Outcome measures: Primary outcome was Alcohol Use Disorders Identification Test (AUDIT) score at 12 months. Key secondary outcomes were past 30 day alcohol use via Timeline Follow-Back (TLFB); alcohol biomarkers [ALT (alanine aminotransferase), {gamma}-GT (gamma-glutamyl transferase), MCV (mean corpuscular volume)]; locally developed measures of addiction-relevant cognitive, affective, behavioural factors [AARSU (Attitude Assessment Related to Substance Use), BARSU (Behaviour Assessment Related to Substance Use)]; and Quality of Life Enjoyment and Satisfaction Questionnaire Short Form (Q-LES-Q-SF). Outcomes were assessed at baseline, 6, and 12 months. Results: Participants in DT (n=33) and ST (n=38) were comparable at baseline. Both groups showed clinically and statistically significant improvement in AUDIT scores over 12 months (DT: F=39.90, p<0.001; ST: F=49.90, p<0.001), with no group x time interaction (F<0.001, p=0.98). Biomarkers and AARSU, and BARSU and Q-LES-Q-SF to a lesser degree, mirrored the AUDIT pattern. TLFB did not change significantly over time in either arm (p>0.05). In moderator analyses, improvement in AUDIT was not moderated by baseline motivation (F=0.20, p=0.89) but was moderated by baseline AUD severity (F=7.70, p=0.007). No serious adverse events were attributed to disulfiram. Adherence to supervised dosing was generally high during periods of supervision but intermittent overall. Conclusions: In this pilot RCT, family-supervised disulfiram achieved 12-month outcomes comparable to standard therapy in a tertiary Sri Lankan setting. Improvements were independent of baseline motivation and varied by baseline AUD severity. These findings may support family-supervised disulfiram as a culturally feasible first-line option in Sri Lanka; larger, adequately powered multicentre trials are warranted to confirm effectiveness and scalability. Trial registration: SLCTR/2014/021
Wei, X.; Xao, X.; Hou, J.; Wang, Q.
Show abstract
Background & Aims: Accurate assessment of clinical malnutrition using anthropometric and functional indicators could improve the care of elderly trauma patients in intensive care units (ICUs). This study aimed to develop an AI-driven malnutrition assessment toolbox based on a minimal set of clinically feasible indicators. Methods: Multiple machine learning models, including logistic regression, support vector machines, k-nearest neighbors, decision trees, random forests, XGBoost, and neural-network-based ensemble models, were developed using different indicator configurations from a clinically collected patient dataset. Models were trained using baseline and longitudinal measurements to predict malnutrition risk. SHAP analysis was used to interpret the importance of selected indicators. Results: Baseline (Day 1) data alone did not provide a reliable prediction, whereas longitudinal measurements substantially improved performance. Models based on a minimal indicator set, including bilateral mid-upper arm circumference, calf circumference, and key static variables, outperformed models using the full indicator set. Tree-based methods consistently outperformed linear and distance-based models, with the three-time-point XGBoost achieving the best individual performance. Neural-network-based ensemble models further improved predictive stability. The best overall performance was achieved by the ensemble model using the minimal indicator set from Day 1 and Day 3. SHAP analysis confirmed the importance of the selected indicators. Conclusions: This AI-driven toolbox provides an efficient and clinically feasible approach for early malnutrition assessment in elderly trauma patients in the ICU. Its strong performance with a minimal indicator set supports its potential for integration into clinical workflows and future digital twin systems for intelligent nutritional management.
Usuzaki, T.; Matsunbo, E.; Inamori, R.
Show abstract
Despite the remarkable progress of artificial intelligence represented by large language models, how AI technologies can contribute to the construction of evidence in evidence-based medicine (EBM) remains an overlooked issue. Now, we need an AI that can be compatible with EBM. In the present paper, we aim to propose an example analysis that may contribute to this approach using variable Vision Transformer.
Kheirbakhsh, R.; Mathur, P.; Lawlor, A.
Show abstract
Multimodal machine learning leverages complementary information from diverse data sources and has shown strong promise in medical imaging, where multimodal data is critical for clinical decision making. In glioma grading, integrating MRI modalities with clinical data can improve diagnostic accuracy, yet systematic comparisons of fusion strategies remain limited. This study evaluates early, intermediate, and late fusion approaches, addressing the question: How does the inclusion of clinical data alongside MRI modalities influence grading performance? To assess modality contributions, we design adaptable fusion layers and employ interpretability techniques, including attention-based analysis. Our results show that incorporating clinical data consistently outperforms unimodal and MRI-only baselines, with intermediate fusion yielding the most reliable gains. Beyond accuracy, the framework reveals how MRI and clinical features jointly shape predictions, underscoring the importance of both fusion design and interpretability for clinical adoption.
Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.
Show abstract
Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.
Van Oyen, C.; Mirza-Haq, N.
Show abstract
MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.
Rubiera, M.; Bendszus, M.; Leker, R. R.; Hilbert, A.; Werren, I.; Lopez-Ramos, L. M.; Ayesta, M.; Nguyen, T. N. Q.; Bonekamp, S.; Sala, V.; Jubran, H.; Meza, C.; Shalabi, F.; Schwartzmann, Y.; Cano, D.; von Tottleben, M.; Kelleher, J.; Frey, D.
Show abstract
Introduction Despite the proven benefits of reperfusion therapies in acute ischemic stroke, treatment decisions in the hyperacute phase remain complex and are rarely supported by individualized outcome predictions. Artificial intelligence (AI)-based clinical decision support systems (CDSS) offer potential real-time prognostic estimates, but prospective evidence of their feasibility and performance in routine clinical workflows is limited. Our aim is to prospectively evaluate real-time feasibility, usability, and predictive performance of an AI-based CDSS (VALIDATE-CDSS) for individualized outcome prediction in acute stroke care. Methods and analysis Prospective, multicenter, observational study enrolling consecutive patients with acute ischemic stroke presenting to three tertiary stroke centers. Clinical management will follow standard practice at the discretion of treating physicians. In parallel, a dedicated researcher will collect patient data in real time and input them into the VALIDATE-CDSS using a mobile application, operating in shadow mode without influencing clinical decisions. The system will generate individualized predictions of 3-month functional outcome (modified Rankin Scale) for four treatment strategies (intravenous thrombolysis, endovascular thrombectomy, combined therapy, or no reperfusion) at three sequential time points: baseline clinical data, non-contrast CT, and CT angiography. The primary outcome is the real-world feasibility and usability of the VALIDATE-CDSS in the hyperacute stroke workflow. Secondary outcomes include predictive performance, agreement between model-suggested and actual treatments, incremental value with increasing data availability, and assessment of potential bias across predefined subgroups. This study will provide prospective real-world evidence on the implementation and clinical potential of AI-based decision support for personalized treatment selection in acute ischemic stroke Ethics and dissemination Patient enrollment began after approval from the ethics committees of all participating centers. Results will be disseminated through peer-reviewed open-access journals and conference presentations. Following open science principles, anonymized data and metadata will be made publicly available in the Zenodo repository upon study completion. Trial registration: ClinicalTrials.gov (NCT05622539).
Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.
Show abstract
Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.
Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.
Show abstract
The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.
de Boer, S.; Häntze, H.; Ziegelmayer, S.; van Ginneken, B.; Prokop, M.; Bressem, K. K.; Hering, A.
Show abstract
Background: Medical imaging, especially computed tomography and magnetic resonance imaging, is essential in clinical care of patients with renal cell carcinoma (RCC). Artificial intelligence (AI) research into computer-aided diagnosis, staging and treatment planning needs curated and annotated datasets. Across literature, The Cancer Genome Atlas (TCGA) datasets are widely used for model training and validation. However, re-annotation is often necessary due to limited access to public annotations, raising entry barriers and hindering comparison with prior work. Methods: We screened 1915 CT scans from three TCGA-RCC databases and employed a segmentation model to annotate kidney lesion. After a meta-data-based exclusion step, we hosted a reader study with all papillary (n=56), chromophobe (n=27) and 200 randomly selected clear cell RCC cases. Two students quality checked and corrected the data as well as annotated tumors and cysts. Uncertain cases were checked by a board-certified radiologist. Results: After data exclusion and quality control a total of 142 annotated CT scans from 101 patients (26 female, 75 male, mean age 56 years) remained. This includes 95 CTs with clear cell RCC, 29 with papillary RCC and 18 with chromophobe RCC. Images and voxel-level annotations of kidneys and lesions are open sourced at https://zenodo.org/records/19630298. Conclusion: By making the annotations open-source, we encourage accessible and reproducible AI research for renal cell carcinoma. We invite other researchers who have previously annotated any of these cohorts to share their annotations.
Bolpagni, M.; Pozza, M.; Gabrielli, S.
Show abstract
Chronic psychological stress contributes to allostatic load and is associated with cardiovascular, metabolic, and mental health disorders. Wearable devices enable continuous, noninvasive monitoring of autonomic signals such as heart rate variability (HRV), creating new opportunities for real-time stress assessment. Large language models (LLMs) are increasingly explored as interfaces for interpreting such data, but it remains unclear whether their predictions reflect physiologically meaningful patterns or rely on superficial heuristics. In this study, we assess whether LLM-derived stress predictions are physiologically coherent and how this varies with model scale. Using a longitudinal wearable dataset collected in naturalistic conditions (35 participants; 5,100 five-minute windows with HRV and contextual features), we obtained stress pseudoprobabilities from three models in the Mistral 3 family (675B, 14B, 3B) via zero-shot prompting. To make model behavior interpretable, we trained surrogate models to approximate LLM outputs and analyzed feature-response relationships using SHAP. Our results indicate that surrogate models closely reproduced LLM predictions (R{superscript 2} up to 0.915; Cohen's k up to 0.941), enabling high-fidelity characterization of decision patterns and providing a practical framework for auditing the physiological coherence of LLM-derived predictions. Physiological coherence increased with model scale: the largest model exhibited near complete alignment with established HRV stress responses, together with stable, predominantly monotonic feature effects and a balanced integration of physiological and contextual information. This pattern weakened at smaller scales, with the mid scale model showing partial alignment and the smallest model displaying reduced stability, greater feature concentration, and more irregular, non monotonic relationships. These findings indicate that larger LLMs encode more physiologically consistent representations of stress, whereas smaller models rely on simplified and less stable strategies, and highlight the value of surrogate based analysis as a practical framework for evaluating LLM behavior in biomedical applications and supporting their responsible integration into wearable health analytics.
Law, S. Y. R.; Mukadam, N.; Pourhadi, N.; Chaudry, A.; Shiakalli, A.; Rai, U.; Livingston, G.
Show abstract
ObjectiveTo examine whether menopausal women who initiate systemic menopausal hormone therapy (MHT) around menopause (45-60 years old) have a different risk of developing dementia than those not taking MHT. DesignSystematic review and meta-analysis of randomised controlled trials and longitudinal observational studies. Risk of bias was assessed using ROB-2 and ROBINS I-V2. Data sourcesMEDLINE, Web of Science, EMBASE, and Cochrane Library to 27 March 2026. Eligibility criteria for selecting studiesStudies which measured dementia or cognitive decline in women who initiated systemic MHT between ages 45-60 or within 5 years of menopause, compared with placebo or no MHT. Authors contacted for additional details if needed. Main outcome measuresDementia, Alzheimers disease (AD), cognitive decline. Results10 studies totalling 213,678 participants (189,525 in studies with the primary population). There was no significant increased risk in women with a uterus for all cause dementia (pooled hazard ratio (HR): 1.12; 95% CI 0.91-1.31, N=78,613, I2 = 96.9%), but increased AD risk (HR: 1.14; 95% CI 1.02, 1.29, N=134,865, I2 = 35.6%). Results were similar in sensitivity analyses including women with or without a uterus. Results for cognitive decline were variable. ConclusionsMHT initiated around the age of menopause should not be prescribed for cognition or dementia prevention. It is not protective against dementia and may increase risk slightly. The magnitude of risk was similar in AD and dementia, but the latter with larger confidence intervals. Studies which followed up individuals rather than on health records lost people to follow up. This may account for difference in cognitive decline outcomes between studies, as people with cognitive impairment and dementia are more likely not to attend. MHT prescribing should balance benefits against risks, including evidence of a small increased dementia risk. There are few high-quality studies, so further research would inform recommendations. Systematic review registration Prospero CRD420251010663 What is already known on this topic?O_LIMenopausal hormone therapy (MHT) is effective for alleviating vasomotor symptoms. Contemporary guidelines recommend treatment should be initiated for such symptoms under age 60 and or within 10 years of menopause onset. C_LIO_LIA large randomised trial on the topic found increased risk of dementia in women initiating MHT after the age of 65. C_LIO_LIIt is unknown whether initiating MHT around the age of menopause impacts the risk of dementia or cognitive decline. C_LI What this study addsO_LIThere was no evidence that taking MHT around the time of menopause decreases the risk of dementia or cognitive impairment. C_LIO_LIThey should not be prescribed for these indications. C_LIO_LIWe were able to find more studies which examine this question by contacting authors for additional data. C_LIO_LIInitiating MHT in women with a uterus around the age of menopause increased the risk of Alzheimers disease slightly, by over 10%, and there is a similar but not significant effect in the fewer studies of all cause dementia. Women with or without a uterus show similar results. C_LIO_LIWe found no significant difference shown in cognitive decline, possibly due to loss to follow up. This may be because most studies of cognitive decline follow up C_LI
Reinosa, R.
Show abstract
Introduction: The translation of biomarkers into binary clinical decisions requires the determination of precise cut-off points. This study validates the TholdStormDX v0.0.1 tool, a mathematical engine that employs Dual Annealing, 2- and 4-parameter logistic fitting, and vectorized Monte Carlo simulations for panel optimization under Boolean OR logic. Methods: The tool was evaluated using datasets from four diagnostic domains (Pulmonary Nodules, Hepatocellular Carcinoma [HCC], Cervical Cancer, and Breast Cancer), along with a prognosis-oriented analytical context (Breast Cancer). Validation followed a strict workflow: characterization and selection of the best individual and combined thresholds in the Training (Train) and Validation (Val) sets, using the Test set in a completely independent manner, solely to assess the model s performance and generalizability. Results: The tool enabled precise derivation of cut-off points for both individual biomarkers and multivariable combinations. Evaluation on the Test set objectively demonstrated in which scenarios a single biomarker outperforms a complex panel, promoting clinical parsimony. For example, in Breast Cancer diagnosis, an individual predictor outperformed the optimized panel (Sensitivity: 0.953 / Specificity: 0.952 in Test); conversely, in Hepatocellular Carcinoma, the multivariable combination showed superior performance compared to the single marker (Sens: 0.707 / Spe: 0.718 in Test). Additionally, the self-auditing system effectively flagged metric degradation when noisy variables were included, preventing potential issues. Conclusion: TholdStormDX v0.0.1 proves to be a robust and transparent bioinformatics platform for deriving clinical thresholds. Its main contribution lies in mitigating local minima and promoting clinical parsimony, enabling researchers to objectively identify when a single biomarker is sufficient and when a panel provides real added value. Furthermore, it transforms the problem of biological noise into a safety feature: by systematically warning about algorithmic instability, it prevents overfitting and ensures the clinical viability of medical decisions. Availability: The software is free and distributed under the GNU GPLv3 license. TholdStormDX v0.0.1 is written in Python, and its source code is available at the following GitHub address: https://github.com/roberto117343/TholdStormDX.
Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.
Show abstract
Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.
Rodriguez, A. M.; The Pooled Resource Open-Access ALS Clinical Trials Consortium,
Show abstract
Standard analysis of amyotrophic lateral sclerosis (ALS) clinical trials evaluates therapeutic efficacy by comparing linear slopes of total ALS Functional Rating Scale (ALSFRS) scores between treatment arms. This approach compresses multidomain ordinal data into a single scalar trajectory, discarding distributional structure. When subgroup-level trends differ in timing or direction, such aggregation can attenuate or eliminate them, a phenomenon known as Simpsons paradox. Here we apply Shannon entropy, computed from item-level score distributions within each ALSFRS functional domain following the framework established in [8], to the PRO-ACT database, stratified by treatment arm (Active: n = 4,581; Placebo: n = 2,931; 19 monthly time points). The entropy trajectories of drug-treated and placebo populations diverge visibly and systematically across all four functional domains (Bulbar, Fine Motor, Gross Motor, Respiratory). In the Fine Motor domain, the placebo population reaches peak entropy at month 8 and reverses, while the active population does not peak until month 13, a five-month delay in the populations transit toward functional loss. This divergence is model-independent: it is present in the raw Shannon entropy trajectories before any dynamical model is applied. A permutation test shuffling patient-level arm labels (n = 1,000 permutations) confirms that the total integrated absolute divergence across all four domains exceeds the null distribution at p < 0.001 (observed: 4.48; null: 2.03 {+/-} 0.33; 7.5 standard deviations above the null mean), with Fine Motor (p = 0.001) and Respiratory (p < 0.001) individually significant. The quantity that differs between arms, the shape and timing of the populations distributional evolution, does not exist as a measurable quantity in the total-score linear-slope framework used to evaluate these trials. Whether this signal reflects genuine treatment effects, compositional artifacts from pooling heterogeneous trials, or both cannot be determined from the anonymized public database alone. What can be determined is that the standard ALS clinical trial endpoint makes an implicit assumption, that the distributional information it discards is uninformative, and the present results demonstrate empirically that this assumption is false.
Sun, S.; Cai, C. X.; Fan, R.; You, S.; Tran, D.; Rao, P. K.; Suchard, M. A.; Wang, Y.; Lee, C. S.; Lee, A. Y.; Zhang, L.
Show abstract
Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance. In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45-0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54-0.65]), followed by EHR+EyeCLIP (0.57 [0.52-0.62]) and EHR+VisionFM (0.56 [0.51-0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan-Meier curves. Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences. These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.